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Abstract 

In many applications, it is of interest to approximate data, given bymxn 
matrix A, by a matrix B of at most rank k, which is much smaller than m and 
n. The best approximation is given by singular value decomposition, which is 
too time consuming for very large m and n. 

We present here a Monte Carlo algorithm for iteratively computing a fc-rank 
approximation to the data consisting of m x n matrix A. Each iteration involves 
the reading of 0(k) of columns or rows of A. The complexity of our algorithm 
is O(kmn). Our algorithm, distinguished from other known algorithms, guar- 
antees that each iteration is a better fc-rank approximation than the previous 
iteration. We believe that this algorithm will have many applications in data 
mining, data storage and data analysis. 

1 Introduction 

In many applied settings, dealing with a very large data set, it is important to 
reduce the size of data in order to make an inference about the features of data set, 
in a timely manner. Assume that the data set is represented by an m x n matrix 
A € W nxn . It is important to find an approximation B £ K mxn of a specified rank 
k to A, where k is much smaller than m and n. 

Here are several motivation to obtain such B. First, the storage space needed 
for B is k(m + n), which is much smaller than the storage space mn needed for A. 
Indeed, B can be represented as 

J B = x 1 yT +X2 yJ + ...x fc y^, (1.1) 

where Xi , . . . , x& and yi , . . . , y& are k column vectors with m and n coordinates 
respectively. We store the vectors xi, . . . , x&, y 1; . . . , y^, which need the storage 
k(m + n), and compute the entries of B, when needed, using the above expression 
of B. 

The second most common application is clustering algorithms as in [1], [2], [3], [4] 
and [9]. Assume that our data represents n points and each point has m coordinates. 
That is each point represented by a column of the matrix A. We want to cluster the 
points in such a way that the distance between points in the same cluster is much 
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smaller than the distance between any two points from different clusters. One way 
to do this is to project all the point on the k main orthonormal directions encoded 
by the first k left singular vectors of A. Then cluster using either k one dimensional 
subspaces or the whole k dimensional subspace. The /c-rank approximation B gives 
the approximation to this k dimensional subspace and the approximation to the first 
k singular vectors. Another way to cluster is to use the projective clustering. It is 
known that a fast SVD, i.e. fast fc-rank approximation, is a main tool in this area 
[2] and [3]. 

The third application is in DNA microarrays, in particular in data imputation [6]. 
Let A be the gene expression matrix. The m rows of the matrix A are indexed by the 
genes, while the n columns are indexed by the number of experiments. It is known 
that the effective rank of A is k, is usually much less then n. The SVD decomposition 
is the main ingredient of the FRAA, {fixed rank approximation algorithm), which 
successfully implemented in [6] to impute the corrupted entries of A. The fast k- 
approximation algorithm suggested in this paper, combined with the clustering of 
similar genes, can be implemented to improve the FRAA algorithm. 

The best approximation B, which minimizes the Frobenius norm \ \A — B\\^ := 
trace (A - B) T (A - B), is given by the singular value decomposition (SVD) of A. 
However SVD decomposition needs 0(mn ■ min(m, n)) time computation, which is 
often too prohibitive. 

One way to find a fast fc-rank approximation is to choose at random I > k 
columns or rows of A and obtain from them Ai-rank approximations of A. This is 
basically the spirit of the algorithm suggested in [7]. We call this algorithm the 
FKV algorithm. Assuming a statistical model for the distribution of the entries of 
A the authors give some error bounds on their fc-rank approximation. The weak 
point of FKV algorithm is its inability to improve iteratively FKV approximation 
by incorporating additional parts of A. In fact in the recent paper [3], which uses 
FKV random algorithm, this point is mentioned specifically in the end of the paper: 
it would be interesting to design an algorithm that improves this approximation 
by accessing A (or parts of A) again." 

The aim of this paper is to provide a sampling framework for iterative updates 
of k-rank approximations of A, by reading iteratively additional columns or rows 
of A, which improves for sure the approximation B each time it is updated. The 
quality of the approximation of B is given by the Frobenius norm ||-B||f ; which 
is a nondecreasing sequence under these iterations. The rate of increase of the 
norms ||-B||f can be used as a stopping rule for terminating the algorithm. Also 
the updating algorithm of /c-rank approximation gives approximation to the first k 
singular values of A, and the approximations to the first k left and right singular 
vectors of A. 

Assuming that the number of columns or rows which are read is I = O(k), the 
complexity of our algorithm is 0(kmn). The intensive part of the computations is 
devoted to obtain the spectral decompositions of (k + I) x (k + I) real symmetric 
matrices, where the computation for each decomposition is of order 0(k 3 ). The 
O(kmn) part of the algorithm is due to the multiplications of k column vectors, 
which approximate the k left or right singular eigenvectors of A, by A T or A respec- 
tively. This part of the algorithm can be parallelized, which will speed significantly 
the algorithm suggested here. The simulations that we performed show that we 
need a small number of iterations, (around 5), to get a very good approximation to 
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the best /c-rank approximation of A. 

We believe that this algorithm will have many applications in data mining, data 
storage and data analysis. 



2 SVD 

In this section, we recall some basic facts about SVD, which are embedded in our 
algorithm, see [8]. Let R m ,M mxn , O m d(R), S n (R), be the linear space of real column 
vectors with m coordinates, the linear space of real m x n matrices, the subset of 
m x d real valued matrices whose d (< m) columms is an orthonormal system and 
the subspace of n x n real symmetric matrices. For S G S n (R) we let S > if S is 
nonnegative definite and S > if S is positive definite. Denote by diag(di, . . . , d m ) G 
l mxm the matrix with the diagonal entry di on the position for i = l,...,m 
and all other entries are equal to zero. Let r = rank^4 be the rank of A. Then the 
SVD decomposition of A is given as A = U r Yi r Vj where UjU r = V?V r = I r , and 
S r = diag((Ti, cr r ), o\ > ... > a r > 0. Let Ui, ■ ■ ■ ,u r € M. m and vi, . . . , v r G W l 
denote 1, . . . ,r orthonormal columns of U and V respectively. Then A = U r T, r V^F 
can be written as A = Ylq=i (J q u q v 'q ■ The vectors Uj,Vj are called the left and 
right singular vectors of A respectively, which correspond to the singular value <7j. 
The left and the right singular vectors can be computed from the right and the left 
singular vectors by the formulas: 

^ = — Avi, Vi = —A T Ui, i = l, ...,r. (2.1) 

Equivalently, a\ > . . . > of > are all the positive eigenvalues of the nonnega- 
tive definite symmetric matrices AA^jA^A, with the corresponding orthonormal 
eigenvectors ui, ■ ■ ■ , u r € M m and vi, . . . , v r G W 1 . 

Denote by ||A||f := Vtrace A T A = Vtrace AA T the Frobenius (£2) norm of 
A. It is the Euclidean norm of A viewed as a vector with nm coordinates. Each 
term UgV^F in SVD decomposition of A is a rank one matrix with ||u ? vJ||f = 1. Let 
lZ(n, m, k) denote the set ofnxm matrices of at most rank k (min(m, n) > k). Then 
for each k, k < r, the SVD of A gives the solution to the following approximation 
problem: 



min \\A - B\\ F = \\A -y^cjqUqV 1 

B&n(n,m,k) 



\ 



q=k+l 



If <7fc > cjfc+i then Ylq=i a q VL q^ v 'q ls the unique solution to the above minima 
problem. For the purposes of this paper, it will be convenient to assume that a q = 
for any q > rank A. 

Given k G [1, r] then af, . . . ,a^, ui, . . . , and Vi, . . . , Vfc are characterized by 
the following maximal characterization: 

Theorem 2.1 Let A G IR™^ and k G [1, min(m, n)]. Let xi,...,Xfc and 
yi, . . . ,yk be two sets of orthonormal vectors in W 71 and W 1 respectively. Then 

i=l i=l i=l i=l 
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Equality in the first or second inequality occurs i/span(xi, . . . ,x&) or span(yi, . . . ,yk), 
contains k linearly independent left or right singular vectors of A, corresponding the 
the k maximal singular values of A respectively. 

The above characterization follows from the maximal, (Ky Fan characterization), 
of the sum of the first biggest eigenvalues of a real symmetric matrix: 

Theorem 2.2 Let S G 5 P (M) be a real p x p symmetric matrix. Let Ai > 
... > A p be the eigenvalues S arranged in a decreasing order and listed with their 
multiplicities. Let Wi,...,w p G W be an orthonormal set of the corresponding 
eigenvectors of S: Swj = AjWj,i = l,...,p. Let k G [l,p] be an integer. Then for 
any orthonormal set xi, . . . , x^ G MP 

k k 

i=l i=l i=l 

Equality holds if and only if the subspace span(xi, . . . , x^) contains k linearly inde- 
pendent eigenvectors of S corresponding to the eigenvalues Ai,. . . ,A&. 

See for example [5] for proofs and the references. Note that xi, . . . , xj. G M m is 
a system of orthonormal vectors in M. m if and only if the m x k matrix (xi, . . . , x&) 
is in O mfc (R). 

To obtain Theorem 2.1 from Theorem 2.2 we let p = m, (or p = n), and S to be 
equal to AA T , (or A^A). In (2.3) we emphasized the complexity of the computations 
of the left-hand side of the inequalities. See also [6] for applications of Theorem 2.2 
to data imputation in DNA microarrays. 

In the rest of the paper we give the version of our results to fc-orthornormal sys- 
tems xi, . . . , Xfc G M m . Similar results holds for fc-orthonormal systems yi, . . . , y& G 
R n . 

Corollary 2.3 Let A G W nxn and k G [l,min(m, n)\ be an integer. Then for 
any k-orthornormal system xi, . . . ,x^ G W n the following equality holds: 

k k 

\\A-^ i ^A)\\l = \\A\\l-^{A^{A^ i ). (2.5) 

i=l i=l 

In particular the best k-rank approximation of A is given by Yli=i u i(. u 7A), where 
Ui, . . . , Ufc G W n is an orthonormal sets of the left singular vectors of A correspond- 
ing to 01, . . . ,0-fc. 

The next theorem is the key theorem for updating the fc-rank approximation for 
Y,i=i x i( x T A )> for some ( x i> • • • ,x fe ) G O mfc (R). 

Theorem 2.4 Let xi,...,Xfc G M m , be an orthonormal system in W 11 . Let 
Wi,...,W; G M m , be a given set in M m . Perform the modified Gram-Schmidt al- 
gorithm on xi, ... ,Xfc,wi, ... , to obtain an orthonormal set xi,. . . ,x p G M. m , 
where k < p < k + I. Assume that k < p, i.e. span(wi, . . . , w;) ^ span(xi, . . . , x^). 
Form pxp real symmetric matrix S := ((A T Xj) T (.A T Xj))? . =1 , and assume that X\ > 
. . . > Afc are the k-largest eigenvalues of S with the corresponding k-orthonormal 
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vectors oi, . . . , G M p . Let O := (oi, . . . , 0&) G O p fc(M) and define k-orthonomal 
vectors x±, . . . , X& G M m as follows: 

(xi, . . . ,x fc ) = (xi, . . . ,Xp)0. (2.6) 

T/ien 

fc fc 
^(A T Xl ) T (A T Xl ) < ^(A T Xl ) T (^ T Xl ). (2.7) 

i=l i=l 

Ftiri/iermore 

Aj = (A T ^ i ) T (A T ^),i = l,...,k. (2.8) 

We now explain the essence of Theorem 2.4. View xi, . . . , x& as approximation 
to the first fc-left singular vectors of A, and J2i=i x «( j 4 Tx i) T as the &-rank approxi- 
mation to A. Hence (^4 T Xj) T (j4 T Xi) is an approximation to <jf of A for i = 1, . . . , k. 
Read additional vectors wi, . . . ,wj G M m such that at least one of this vectors is 
not in the subspace spanned by xi,...,Xfc. Let X be the subspace spanned by 
xi, . . . ,xj. and wi, . . . , w;. Hence xi, . . . , x&, . . . , x p is the orthonormal basis of X 
obtained from the vectors xi, . . . , x^, wi, . . . , W/, using the modified Gram-Schmidt 
algorithm. (The modified Gram-Schmidt algorithm used to ensure the numerical 
stability.) Note that k < p < k + I, and in general one has that p = k + I. Consider 
the p x p nonnegative definite matrix S = ((A T Xj) T ( J 4 T Xj))^ =1 . Find its first k 

eigenvectors to obtain xi, . . . G R m using (2.6). Then C := Y^lt=i x i(^- Tx i) T is 
the best approximation of A by matrix B of rank k at most, whose columns are 
in the subspace X. In particular, the approximation C is better than the previous 
approximation Yli=i x «( j 4 Tx «) T ) which is equivalent to the (2.7). 

Outline of Proof of Theorem 2.4. Let S = (sij)^j =v Let ej = (8n, . . . ,Si p ) T ,i = 

1, . . . ,p be the standard orthonormal basis in W. Use the definition of A and Ky 
Fan characterization of the sum of the maximal k eigenvalues of symmetric S to 
deduce 

k k k k 

Y J (A T ^ i ) T (A T ^ t ) = J2 e JSei < = J2°JSo l . 

i=l i=l i=l i=l 

Let C := A T (xi, . . . ,x p ). Then S = C T C. Hence the ^ > ... > y% arc 
the singular values of C and oi, . . . ,o p are the right singular vectors of C. Thus 
Coj = \f\it{ G W 1 , where tj is the left singular vector of C corresponding to 
the singular value s/Xi for % = 1, . . . ,p. A straightforward calculation shows that 
Xi = oJSoi = (yl T Xi) T (^4 T Xj) for i = 1, . . . , k. Hence (2.7) holds. Furthermore we 
also deduced the equalities in (2.8) for i = 1, . . . , k. □ 



3 Algorithm 

One starts the algorithm by choosing the first fc-rank approximation to A as follows. 
Let c i> • • • j c n G M. m be the n columns of A. Choose k integers 1 < n\ < . . . < rik < 
n. Let xi, . . . , x ? G M m be the orthonormal set obtained from c ni , . . . , c„ fc using the 
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modified Gram- Schmidt algorithm. Set 

q 

B := J>(A T x i ) T . (3.9) 

i=i 

In general q = k, but in some cases if xi,...,x& are linearly dependent, q < k. 
Assume for simplicity of the exposition that q = k. Then Bq is of the form (1.1) 
where y« = A Xj, i = 1, . . . , k. 

Now update iteratively /c-rank approximation of B t ~\ of A to , using Theorem 
2.4, by letting wi := Cj 1: . . . ,wj := Cj ; G M m , for some / integers 1 < ji < . . . < 
jl < n - That is, we choose another I sets of columns of A, preferably that were not 
chosen before, and update the /c-rank approximation using the algorithm suggested 
by Theorem 2.4 to obtain an improved /c-rank approximation Bt of A. 

Furthermore one can use the /c-rank approximation B t from the above algo- 
rithm to approximate the first /c-singular values u\ , . . . , cr^, and the left and the 
right singular vectors m, . . . ,Ufc and vi, . . . , as follows. First, the square roots 
y/ Ai(5), . . . , \f \k{S) of the matrix S are approximations for a±, . . . , a^. Second, 
the vectors xi, . . . , x& approximate ui, . . . , u^. Let v, := A T Xj, i = l,...,k. Then 
||vi|| = \J Xi{S) for i = l,...,k. Third, the renormalized Vj which are given as 
U^TjyVj approximate the right singular eigenvectors Vj for i = 1, . . . , k. 



Fast A>rank approximation and SVD algorithm 

Input: positive integers m, n, k,l,N, m x n matrix A, e > 0. 

Output: anmxn /c-rank approximation Bf of A, with the ratios jj^jj and ^g^y 

approximations to /c-singular values and k left and right singular vectors of A. 

1. Choose /c-rank approximation Bq using k columns, (or rows), of A. 

2. for t = 1 to N 

- Select I columns, (or rows), from A at random and update B t _\ to B t . 

- Compute the approximations to /c-singular values, and k left and right 
singular vectors of A. 

- If > 1 - e let / = t and finish. 



We now explain briefly the main steps of our algorithm. We read the dimensions 
m, n of the data matrix A. We set as the maximal number of iterations we are 
going to execute to find the /c-rank approximation of A. We read the entries of 
the data matrix A, and finally the small parameter e > 0. We choose the /c-rank 
approximation Bq using (3.9). Assume that B t -\ is the current /c-rank approxima- 
tion to A. Next we choose additional I columns of A and update B t -\ to B t using 
Theorem 2.4 as explained in the previous section. Recall that ||5f_i|| < ||-£>t||. If 
the relative improvement in ||-B t || is less than e, i.e. ^p^jp > 1 — e, we are satisfied 
with the approximation B t and finish our algorithm. If this does not happen then 
our algorithm stops after the N iteration. 



4 Complexity of the algorithm 

We assume here that m > n and we consider our algorithm applied to the randomly 
selected columns of A. (Otherwise we apply our algorithm to the rows of A.) The 
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first step of our algorithm is to find an orthonormal Xi , . . . , x g basis of the sub- 
space spanned by k randomly chosen columns of A. The modified Gram-Schmidt 
algorithm needs k 2 m flops assumed the worst, (generic), case q = k. This step 
needs 0(k 2 m) flops [8]. Let y$ := A T x.i 6 R n for i = 1, . . . , k. The computations 
of yi, • • • , yfc needs kmn operations and can be parallelized. Bo = Yli=i an d 
does not have to be computed. 

To find B\ we choose I columns wi, . . . , W; of A at random. Then we perform the 
modified Gram-Schmidt algorithm on xi, . . . , x&, wi, . . . , w; to find an orthonormal 
system xi,...,x p , with p £ [k,k + I]. This step requires m(k + I) 2 flops. Let 
yi = yl T Xj for i = k + 1, . . . ,p. This step needs at most Imn flops and can be 
parallelized. Next we have the following two choices. First possibility is to find the 
SVD of C\ = [yi, . . . , y p ]. This needs at most 0{(k+l) 2 n) flops. Second possibility is 
to compute the matrix S\ = CjCi given in Theorem 2.4. This needs nl(l + 2k + l)/2 
flops. Then we compute the spectral decomposition of S± which needs 0((k + I) 3 ) 
flops. If k + 1 is much smaller than n, it seems to us that the second method is more 
efficient. 

The first k right singular vectors of C\ are identical to the first k eigenvectors 
of S. Use these k vectors, as explained in Theorem 2.4, to update xi, . . . ,x& and 
yi, . . • , y&. This needs kp(m + n) flops. 

To update B t -\ to B t for t > 1 require the same number of flops as to compute 
B\. Hence, the most intensive part of the computation lies in the computation of 
the spectral decomposition St, which is 0((k + Z) 3 ). Assuming that I = 0(k), we 
deduce that the intensive part of the computation is of order 0(k 3 ). The simulations 
that we performed show that we need a small number of iterations, (around 5), to 
get a very good approximation to the best fc-rank approximation of A. Hence our 
algorithm is expected to converge in O(kmn) steps. 

5 Simulation results 

To assess the performance of our k-rank approximation algorithm we conducted 
different simulation on synthetic data and images. We also implemented our al- 
gorithm for three different sampling methods, uniform sampling with replacement, 
uniform sampling without replacement and weighted sampling for images accord- 
ing to weight of each row in gradient image. To show our algorithm guarantee 
convergence to the optimum (deterministic) k-rank approximation we applied the 
algorithm on a randomly generated data matrix of 3000 x 500 with rank 500. To 
measure the approximation error we defined the relative error of the approxima- 
tion as, || A — B t ||p / || A || F , where A is the original data matrix and B t is 
k-rank approximation to A. Figure 1 shows convergence of the relative error to the 
optimum relative error , \\A — Ai~\\p / \\A\\p where A^ is the optimum true k-rank 
approximation to matrix A, as a function of iteration parameter N for k = 100 on 
the synthetic data matrix. In each iteration we randomly picked I = 100 rows of the 
data matrix with and without replacement. For comparison the optimum relative 
error has been pointed out on the axis. As we expected the convergence property 
is faster when the rows are sampled without replacement. Figure 2 shows the same 
convergence result for the real images of Cameraman and Liftingbody (from Image 
Processing Toolbox of Matlab). Figure 3 shows the plots of the relative error versus 
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Table 1: Comparison of relative error and speed up of our algorithm with optimum 
fc-rank approximation algorithm 



Data sets 


Speed up 


Re. ratio 


Camcraman(256 x 256), k — 80 


1.145 


1.083 


Liftingbody (512 x 512), k = 100 


8 


1.08 


Map image(627 x 865) k = 200 


3.33 


1.067 


Random matrix(8000 x 200) k = 100 


42 


1.1 



total number of sampled rows in each method of sampling for the Cameraman and 
Liftingbody images. It can be seen that for these images our algorithm return very 
reasonable relative error only by using portion of the data. Needless to say that the 
importance of this iterative algorithm can be observed when it is applied to the big 
matrices of data where the SVD of the data matrix is often impossible to compute. 
In this case the simple randomized SVD methods [7] may not be fast, since it needs 
to sample large enough number of rows to give the acceptable error bound on the 
approximation. To highlight this feature of our algorithm we applied our algorithm 
on a synthetic full rank data matrix of 8000 x 200. We chose the parameter I, N 
such that the relative error to be less than 2 times of optimum relative error. We 
observed that the speed up of our algorithm, which is the rate of the time needed 
by deterministic SVD to compute 100-rank approximation to the time needed by 
our algorithm to compute Bt, is 42 in our system. The result for three more data 
sets with different values of k has also been illustrated in Table 1. Due to system 
limitation and the comparison issue we were not able to show the speed up for bigger 
matrices where deterministic SVD (svd in Matlab) may fail to compute the svd of 
the matrix. However, our algorithm can always compute the k-rank approximation 
of the matrix as long as k + I is not too big since the algorithm in each iteration 
deals with small matrices. 

6 Conclusions 

We have proposed a novel approach for fast computing of fc-rank approximation 
of a given m x n data matrix, using Monte-Carlo method by choosing at random 
I columns or rows of A. The advantage of our algorithm is that we guarantee 
that every iteration improves the quality of our approximation. We applied our 
algorithm on synthetic data matrices as well as images and the result confirms the 
convergence of the relative error of the approximation to the optimum relative error. 
To highlight the important feature of this algorithm we applied this method on a 
big matrix of randomly generated data and we observed for the reasonable level of 
relative error the algorithm is also much faster than optimum k-rank approximation 
using deterministic SVD which may also fail to compute the svd for big matrices. We 
believe that this algorithm will have many applications in data mining, data storage 
and data analysis where dealing with high dimensional data is a major problem. 
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- Uniform sampling without replacement 

- Uniform sampling with replacement 



5 10 
Number of iteration 



Figure 1: Convergence property of the Monte-Carlo method for random data 
matrix(3000 x 500). 



- Uniform sampling without replacement 
-Weighted sampling 

- Uniform sampling wifh replacement 




- Weighted sampling 

- Uniform sampling with replacement 

- Uniform sampling without replacement 



Number of iteration 



Number of iteration 



(a) Convergence property of the Monte-Carlo 
method for Cameraman image(256 x 256), 
k = 80. 



(b) Convergence property of the Monte- 
Carlo method for Liftingbody image(512 x 
512), k = 80. 



Figure 2: Cameraman and Liftingbody 



9 




(a) Cameraman: Relative error versus total (b) Liftingbody: Relative error versus total 

number of sampled rows, k = 80. number of sampled rows, k = 80. 

Figure 3: Cameraman and Liftingbody relative errors 
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